Self-service data platform

ABSTRACT

Disclosed embodiments include a method performed by server computer(s). The method includes receiving a query and defining a query plan based on the received query. The query plan refers to datasets contained in data sources. The method further includes determining that the received query can be accelerated based on an optimized data structure contained in a memory, where the optimized data structure is derived from a dataset referred to in the query plan. The method further includes modifying the query plan to include the optimized data structure, and executing the modified query plan to obtain query results that satisfy the received query by reading the optimized data structure in lieu of reading at least some data from the data sources.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to U.S. provisional patent applicationSer. No. 62/354,268, filed Jun. 24, 2016, which application isincorporated herein in its entirety by this reference thereto.

TECHNICAL FIELD

The disclosed teachings relate to a data platform and, moreparticularly, the disclosed teachings relate to a self-service dataplatform that enables users to discover, curate, accelerate, and analyzedata from one or more data sources.

BACKGROUND

Conventional data analytics systems can collect, analyze, and act upondata contained in data sources. The data sources can be computingdevices that are internal, external, local, or remote relative to thedata analytics system. For example, an external remote data source canbe a server connected over a computer network to the data analyticssystem. Existing data analytics systems have many drawbacks. They aredesigned for use exclusively by information technology (IT)professionals and not end-users. The systems are burdened by usingextract, transform, and load (ETL) pipelines to pull data from the datasources and store the pulled data to a centralized data warehouse ordata lake. These systems are inadequate because they offer only partialand stale data for querying and analysis.

Analysts typically spend significant amounts of time collecting andpreparing data rather than actually analyzing the data with businessintelligence (BI) tools. Examples of BI tools that have analytics orvisualization capabilities include TABLEAU, POWER BI, R, or PYTHON.These tools operate primarily on data that resides in a single, smallrelational database. However, modern organizations use non-relationaldata sources such as HADOOP, cloud storage (e.g., S3, MICROSOFT AZUREBLOB STORAGE) and NOSQL databases (e.g., MONGODB, ELASTICSEARCH,CASSANDRA).

In addition, data is often distributed across disparate data sourcessuch that a user cannot simply connect a BI tool to any combination ofdata sources. A connection mechanism is often too slow, queries oftenfail, volumes of raw data are too large or complex, and data are oftenof mixed types. Further, users seeking flexible access to data analyticssystems oftentimes circumvent security measures by downloading orextracting data into unsecure, ungoverned systems such as spreadsheets,standalone databases, and BI servers for subsequent analysis.Accordingly, users seek capabilities to access, explore, and analyzelarge volumes of mixed data from distributed data sources without beingburdened by rigid data analytics systems available mainly to ITprofessionals.

SUMMARY

The disclosed embodiments include a method performed by servercomputer(s). The method includes receiving a query and defining a queryplan based on the received query. The query plan refers to datasetscontained in data sources. The method further includes determining thatthe received query can be accelerated based on an optimized datastructure contained in a memory, where the optimized data structure isderived from a dataset referred to in the query plan. The method furtherincludes modifying the query plan to include the optimized datastructure, and executing the modified query plan to obtain query resultsthat satisfy the received query by reading the optimized data structurein lieu of reading at least some data from the data sources.

In some embodiments, the method further includes, prior to receiving thequery, generating the optimized data structure to include raw data of atleast one of the datasets, generating the optimized data structure toinclude an aggregation of data column(s) of at least one of thedatasets, generating the optimized data structure to include at leastone of sorted, partitioned, or distributed data of data column(s) of atleast one of the datasets, and/or generating the optimized datastructure to include data sampled from at least one of the datasets.

In some embodiments, the received query is a second query and the queryresults are second query results. The method further includes, prior toreceiving the second query, generating the optimized data structurebased on first query results that satisfy a first query. In someembodiments, the query plan is a second query plan, and a first queryplan is defined to have a scope broader than necessary for obtainingquery results satisfying the first query such that the generatedoptimized data structure is broader than an optimized data structuregenerated based on a query plan having a scope that is minimallysufficient for obtaining query results satisfying the first query.

In some embodiments, the query results are obtained without reading anyof the datasets contained in the data sources or by reading at leastsome of the datasets contained in the data sources in addition toreading the optimized data structure.

In some embodiments, the method further includes autonomously decidingto generate the optimized data structure prior to determining that thereceived query can be accelerated. In some embodiments, the decision togenerate the optimized data structure is based on a history of queriesreceived by the server computer(s) and/or based on a determination thatreading the optimized data structure in lieu of reading the at leastsome data from the data sources improves processing of an expectedworkload.

In some embodiments, the method further includes, prior to receiving thequery, receiving user input requesting acceleration of queries ondataset(s) of the datasets and generating the optimized data structurein response to the received request.

In some embodiments, the method further includes, prior to receiving thequery, receiving user input defining a virtual dataset derived from aphysical dataset contained in the data sources, where the datasetsinclude the virtual dataset.

In some embodiments, the modified query plan is only executed by adistributed query engine of the computer server(s).

The disclosed embodiments include a computer system. The computer systemincludes a processor and memory containing instructions that, whenexecuted by the processor, cause the computer system to connect to datasources that contain physical datasets, cause display of a visualdataset editor, and allow users to curate data by using the visualdataset editor to create virtual datasets derived from the physicaldatasets without creating any physical copies of the curated data.

In some embodiments, the virtual datasets are exposed as tables inclient applications. In some embodiments, the computer system is furthercaused to allow the users to share the virtual datasets via the visualdataset editor.

In some embodiments, the visual dataset editor includes a control thatupon being selected by a user causes the computer system to open aclient application connected to a virtual dataset.

In some embodiments, the computer system is further caused to display avisualization indicative of relationships between physical datasets andvirtual datasets.

In some embodiments, the computer system is further caused toautonomously decide to generate an optimized data structure based on aphysical dataset contained in the data sources, and store the optimizeddata structure in the memory, where the optimized data structureaccelerates execution of a query referring to the physical dataset or avirtual dataset derived from the physical dataset.

Other aspects of the technique will be apparent from the accompanyingFigures and Detailed Description.

This Summary is provided to introduce a selection of concepts in asimplified form that is further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A through 1C are block diagrams illustrating the evolution fromIT-centric to self-service data analytics systems according to someembodiments of the present disclosure;

FIG. 2 is a block diagram illustrating features of a self-service dataplatform for performing data analytics according to some embodiments ofthe present disclosure;

FIG. 3 is a block diagram illustrating a high-level dataflow for aself-service data platform according to some embodiments of the presentdisclosure;

FIG. 4 is a diagram illustrating relationships between queries, virtualdatasets, and physical datasets according to some embodiments of thepresent disclosure;

FIG. 5 is a flowchart illustrating processes of a self-service platformaccording to some embodiments of the present disclosure;

FIG. 6 is a block diagram illustrating an acceleration system of aself-service data platform according to some embodiments of the presentdisclosure;

FIG. 7A illustrates a view of a display for interacting with aself-service data platform according to some embodiments of the presentdisclosure;

FIG. 7B illustrates another view of a display for interacting with aself-service data platform according to some embodiments of the presentdisclosure;

FIG. 8 is a flowchart illustrating a process for accelerating a queryexecution process according to some embodiments of the presentdisclosure; and

FIG. 9 is a diagrammatic representation of a computer system which canimplement some embodiments of the present disclosure.

DETAILED DESCRIPTION

The embodiments set forth below represent the necessary information toenable those skilled in the art to practice the embodiments, andillustrate the best mode of practicing the embodiments. Upon reading thefollowing description in light of the accompanying figures, thoseskilled in the art will understand the concepts of the disclosure andwill recognize applications of these concepts that are not particularlyaddressed herein. It should be understood that these concepts andapplications fall within the scope of the disclosure and theaccompanying claims.

The purpose of terminology used herein is only for describingembodiments and is not intended to limit the scope of the disclosure.Where context permits, words using the singular or plural form may alsoinclude the plural or singular form, respectively.

As used herein, unless specifically stated otherwise, terms such as“processing,” “computing,” “calculating,” “determining,” “displaying,”“generating,” or the like, refer to actions and processes of a computeror similar electronic computing device that manipulates and transformsdata represented as physical (electronic) quantities within thecomputer's memory or registers into other data similarly represented asphysical quantities within the computer's memory, registers, or othersuch storage medium, transmission, or display devices.

As used herein, terms such as “connected,” “coupled,” or the like, mayrefer to any connection or coupling, either direct or indirect, betweentwo or more elements. The coupling or connection between the elementscan be physical, logical, or a combination thereof.

Self-Service Data Platform

FIGS. 1A through 1B are block diagrams illustrating the evolution ofdata analytics systems from IT-centric to self-service systems accordingto some embodiments of the present disclosure. In FIG. 1A, an IT-centricarchitecture includes an extract, transform, and load (ETL) tooloperable to pull data from data source(s) and store the pulled data in adata warehouse. A business intelligence (BI) tool can be used to querythe data warehouse. In FIG. 1B, an intermediate architecture is modifiedfrom the IT-centric architecture of FIG. 1A to include a data warehouseor alternative storage such as HADOOP or cloud storage. The intermediatearchitecture includes an ETL tool but offers a self-service BI tool forend-users rather than a BI tool exclusively for IT professionals.Lastly, FIG. 1C shows a self-service architecture modified from theintermediate architecture of FIG. 1B by replacing the ETL tool and datawarehouse or alternative storage with a self-service data platform.

Disclosed herein are embodiments of the self-service data platform (“theplatform”), which has self-service analytics capabilities for use indiverse environments. The platform enables entities (e.g.,organizations, users, analysts, data scientists) to discover, curate,explore, and analyze diverse data from diverse data sources at any timeand avoids the need to spend excessive time collecting or preparingdata. For example, FIG. 2 is a block diagram illustrating features ofthe platform for performing data analytics according to some embodimentsof the present disclosure.

As shown in FIG. 2 , the platform is coupled between numerous datasources and analysis or visualization tools. In this example, theplatform operates on or is coupled to 1-1,000 servers connecting theanalysis and visualization tools to the numerous data sources. Examplesof the data sources include NOSQL sources (e.g., MONGODB, HBASE,CASSANDRA), search sources (e.g., ELASTICSEARCH, SOLR), file storagesources (e.g., HDFS, NAS, LINUX CLUSTER, EXCEL/CSV), cloud computing(e.g., IaaS, PaaS, SaaS) sources (e.g., AMAZON S3), relational sources(e.g., MySQL, SQL server), and SaaS source (e.g., SALESFORCE, MARKETO,GOOGLE SHEETS, and GITHUB). Examples of analysis or visualization toolsinclude an analyst center (e.g., self-service portal), BI tools (e.g.,QLIK SENSE, TABLEAU, POWER BI) via ODBC/JDBC, data science tools (e.g.,R, PYTHON) and custom applications via REST. As such, users of theanalysis or visualization tools can readily query and analyze data fromthe numerous data sources.

FIG. 3 is a block diagram illustrating a high-level dataflow 300 for theplatform according to some embodiments of the present disclosure. Thedataflow commences with a connection process 302 connecting the platformto data source(s) from which data can be obtained. The obtained dataundergoes a preparation (i.e., curation) process 304. In some instances,the obtained data can undergo manage or explore processes that allows auser to manage, explore, and/or share the prepared data. In addition,the platform can create optimized data structures based on the obtaineddata. A description of optimized data structures is provided furtherbelow. The user can use manage and explore processes 304 to edit theoptimized data structures in an effort to improve subsequent queryexecutions. The prepared data can then undergo an analysis orvisualization process 306 in response to a query execution. For example,a BI tool can be used to visualize queried data. In some instances, asubsequent query execution can undergo an acceleration process 308 torapidly obtain query results based on the optimized data structures. Thedataflow can cycle between the prepare process 304, analyze or visualizeprocess 306, and acceleration process 308 as needed to optimize theoutcome of data analytics performed by the platform.

The self-service features of the platform can improve user experience.Examples of self-service features involve data management andpreparation, integration with diverse data sources, handling dynamicschemas, dataset namespace and path structures, exposing datasetinformation, data intelligence, user-defined data security, anautonomous memory for accelerating query executions, and a BI toollauncher. The self-service features of the platform are described ingreater detail below.

The platform can process a variety of data types from a variety of datasources. For example, the platform can connect to non-relational datasources, relational databases, data warehouses, and spreadsheets togather data in response to a query. For example, the platform canconnect to data sources that traditionally could not be queriedincluding NOSQL databases (e.g., MONGODB, ELASTICSEARCH, HBASE), cloudstorage (e.g., AMAZON S3, AZURE BLOB STORAGE, GOOGLE CLOUD STORAGE), andHADOOP (e.g., HDFS, MAPR-FS). The platform can connect to a combinationof these data sources, and simultaneously or asynchronously query datafrom across these data sources.

The platform may have flexible data connection capabilities. It does notrequire defining a schema, data model, or to ETL data for querying adata source. A schema, as used herein, may refer to a structure thatrepresents a logical view of an entire data store. It defines how datais organized and how the relations among the data are associated. A datamodel, as used herein, may refer to fundamental entities to introduceabstraction in a database management system. Data models can define howdata is connected to each other and how they are processed and storedinside a system. ETL, as used herein, refers to a process for pullingdata out of source systems and placing it into a data warehouse or anyother system.

In some embodiments, the platform supports a full range of structuredquery language (SQL) commands. SQL is a domain-specific language oftenused to interact with a database management system. In this context, SQLcommands can be used to specify the desired query. Examples of SQLcommands include complex joins, correlated sub-queries, and windowfunctions.

In some embodiments, the platform is aware of data sources and theirnative capabilities such that it can employ native query processes. Forexample, the platform can push-down a free-text search to ELASTICSEARCHbecause it knows that this particular data source supports free-textsearches. The platform may enable tools such as TABLEAU, EXCEL, and R toquery data in the data sources.

The platform may have broad data preparation capabilities. For example,the platform can perform real-time data preparation using live data orvirtual data. The platform can also include a virtual dataset editorthat enables an end-user to prepare virtual datasets. Unlike existingsystems, the prepared data of the platform lives in virtual datasetssuch that physical copies of the datasets are not required. Thedisclosed platform can also perform analysis-informed preparation. Forexample, the platform can switch back-and-forth between TABLEAU andpreparation processes. Further, the platform can recommend actions basedon user behavior. For example, machine learning can be used to learnfrom users that use the platform.

The platform may offer enterprise-grade security and governancecapabilities with consumer-grade ease-of-use. This includes versatileand intuitive access control mechanisms. For example, a user can decidewho can access what data at granular levels such as data row or datacolumn levels. The user can even hide some data from users or groups.The platform may also maintain lineage capabilities. That is, datasetsare connected and a user can browse ancestors and descendants of eachdataset and column. The platform may also have auditing capabilitiesthat allow a user to monitor who is accessing data and identify a timewhen the data was accessed. In some embodiments, the platform cangenerate real-time reports showing, for example, the top 10 users of agiven dataset or accessing datasets off-hours.

The platform may offer improved performance and scalability. Theplatform can allow users to interact with data of diverse types and ofany size, and from diverse data sources. The platform may acceleratequery executions by using optimized data structures, also referred toherein as reflections, which can reside in memory or on persistentstorage. As a result, the platform can provide orders of magnitude queryacceleration and production system isolation compared to existingsystems. The platform can also perform columnar in-memory analyticsincluding columnar execution, byte-code rewriting, and runtimecompilation. In some embodiments, such analytics are implemented inAPACHE ARROW.

The platform may support numerous computing devices. For example, theplatform may use a server cluster that can scale to thousands of serversand run on-premise and/or in the cloud. The platform may integrate anynumber of distributed data stores of data sources. The platformunderstands the distribution of data and has capabilities to query eachdata source. This maximizes push-downs (e.g., RDBMS, MONGODB,ELASTICSEARCH) and allows for reading data in parallel from distributeddata stores.

Management and Preparation

The platform may be configured to connect to diverse data sources. Forexample, a user can input connection information for each data source.Examples of connection information include an IP address or domain nameand credentials that enable access to data contained in the data source.The platform can then use the connection information to connect to thedata source and run queries on any datasets of the data source. Anyquery may include multiple datasets and data sources (e.g., throughjoins).

The platform enables users to discover data, curate data, acceleratequeries, and share data of the data sources with other users. Theplatform can include a unified data catalog for users to discover andexplore physical or virtual datasets, data sources, and theirrelationships to each other. As used herein, a “physical dataset” mayrefer to raw data contained in data sources connected to the platform.The platform enables end-users to interact with the physical datasetscontained in these data sources. A physical dataset may belong to anamespace hierarchy exposed by a data source. Examples includerelational tables, MONGODB collections, files or directories of files,or ELASTICSEARCH indexes or types. For example, a MONGODB data sourcecan have a simple hierarchy such as <cluster>.<database>.<collection>.An AMAZON S3 data source can have an arbitrarily complex hierarchy suchas <bucket>.<path>.<to>.<directory>.

The platform enables users to curate data by creating virtual datasets.As used herein, a “virtual dataset” refers to a dataset defined by auser of the platform. A virtual dataset may be derived from physicaldataset(s) or other virtual dataset(s). The platform does not need tosave the actual data (e.g., content) of a virtual dataset. Instead, theplatform only needs to save the definition of a virtual dataset (e.g., a“SELECT” statement in SQL analogous to a database view).

Accordingly, an end-user only needs to be concerned withdatasets—physical and virtual. The platform may support a variety ofpoint-and-click transformations, and users can utilize SQL syntax (oranother supported language) to define more complex transformations. Asqueries are executed, the platform can learn about the data, enabling itto recommend various transformations such as joins and data typeconversions. The data catalog can be automatically updated when datasources are newly added and as data sources or datasets change. Allmetadata may be indexed in a high-performance, searchable index, andexposed to users through the platform's portal interface. For example,users can browse a data graph to understand relationships betweendatasets and monitor what users are doing with a particular dataset. Auser can explore and analyze data regardless of location and size, andneeding minimal or no upfront work.

The platform can accelerate query execution by several magnitudescompared to directly querying datasets contained in data sources. Forexample, the platform can create optimized data structures (i.e.,reflections) based on physical or virtual datasets. The optimized datastructures can reside in memory or on persistent storage referred to asan autonomous memory or reflections data store. The optimized datastructures can be used in lieu of directly querying data sources. Anoptimized data structure can be created autonomously by the platform,manually by a user of the platform, or a combination of both. That is,users can manually designate the datasets to accelerate and/or thesystem may decide which optimized data structures to create autonomouslybased on, for example, past queries and workloads for processingqueries. In one example, users can vote for datasets that they thinkshould be accessed faster, and the platform's cache heuristics canconsider these votes in deciding which optimized data structures tocreate.

In some embodiments, an optimized data structure is anchored to at leastone physical or virtual dataset. At query time, using the optimized datastructure can accelerate querying meant for any underlying sourcedatasets. In some embodiments, the optimized data structure is based onAPACHE PARQUET OR ORC, with a variety of surrounding optimizations suchas column-level statistics. The optimized data structure can be based ondata (e.g., data columns) sorted, partitioned, and distributed byspecific columns.

The optimized data structures are objects, materializations, datafragments, or the like, stored in the autonomous memory of the platform.The memory is referred to as “autonomous” because the platform canautonomously decide to generate optimized data structures that arestored in the autonomous memory for use to accelerate queries. Whenseeking to query a data source, the end-user does not need to considerany optimized data structures or know of their existence. Rather, theuse of optimized data structures by the platform to accelerate a queryis transparent to users. For example, when a query is received from a BItool, an “optimizer” of the platform determines an optimal queryexecution plan (“query plan”), which may include pushing sub-queriesdown into the data sources and/or utilizing suitable optimized datastructures.

An optimized data structure may contain data of any type or size. Theplatform knows the definition (i.e., logical plan) of the optimized datastructure, which allows the platform to refresh the data of theoptimized data structure and to determine, at query time, whether thatoptimized data structure can accelerate the computation of the queryresults. For example, when responding to a received query, the platformtypically must perform a substantial amount of computational work. Thequery results that satisfy the query do not necessarily live in a datasource or in an optimized data structure. Instead, for example, the rawdata can live in the data sources as physical datasets. In anon-accelerated case, the computation starts with raw data and computesthe query results. The optimizer of the platform can identify anopportunity to leverage an optimized data structure when there is a wayto compute the query results based on the optimized data structure. Insome embodiments, the optimizer may return an approximate query resultwithin some user allowable tolerance when optimized data structures areavailable for such approximation rather than obtaining exact results.

The platform may decide whether to generate (i.e., create) optimizeddata structures autonomously, based on user input, or combinationsthereof. For example, in order to facilitate management, the platformcan have each optimized data structure anchored to a specific dataset(physical or virtual). This facilitates the ability of an administratorto understand what an optimized data structure contains and facilitatesidentifying queries on a dataset that are executing too slow such that auser can request creation of an optimized data structure anchored tothat highly queried dataset.

The disclosed embodiments may include different types of optimized datastructures. For example, an optimized data structure anchored to asingle dataset could be a raw reflection or an aggregation reflection. Araw reflection has all the records in a dataset but perhaps only some ofits data columns, sorted, partitioned and distributed by specificcolumns. An aggregation reflection has aggregation (i.e., summarization)of a dataset similar to an OLAP cube with dimensions and measures. Anaggregation reflection can be used to accelerate aggregation queries.Another type of optimized data structure is a sample reflection that hassamples of data from the dataset. A sample reflection can be used by theplatform to accelerate queries by many orders of magnitude if a userallows for approximate query results (e.g., within 0.1% statisticalerror) based on sampled data.

The platform determines whether to leverage optimized data structureswhen defining or modifying a query plan based on a receive query. Forexample, the platform can compile a received SQL query from a clientdevice to define a query plan. The query plan describes how the querywill execute including all operations needed in order to compute thequery results. When the platform determines that one or more optimizeddata structures can be used to accelerate a query, the platform maygenerate the query plan or modify a generated query plan to utilize theoptimized data structure(s) rather than directly query data sources.

The platform may enable users to securely share data (e.g., virtualdatasets or query results) with other users and groups. For example, agroup of users can collaborate on a virtual dataset that will be usedfor a particular analytical job. Alternatively, users can upload theirown data, such as EXCEL spreadsheets, to join to other datasets. In someembodiments, users that created virtual datasets can determine whichother users can query or edit those virtual datasets.

FIG. 4 is a block diagram illustrating relationships 400 between queries402, virtual datasets 404, and physical datasets 406 according to someembodiments of the present disclosure. A physical dataset cannot have aparent dataset but can be the parent of a virtual dataset. A virtualdataset could also be the parent or child of another virtual dataset.The platform knows the relationships between any datasets, and can usethat knowledge to determine what optimized data structures to create andmaintain. Hence, the relationships can be used to accelerate a queryexecution process.

The arrows of FIG. 4 show how the queries 402 can be processed from acombination of virtual datasets 404 and/or physical datasets 406. Thevirtual dataset 404-1 is derived from a combination of the physicaldataset 406-1 and the virtual dataset 404-2, and the virtual dataset404-2 is derived from the physical datasets 406-2 and 406-3. As such,the query 402-1 can be satisfied from the virtual dataset 404-1. Thequery 402-2 can be satisfied from the virtual dataset 404-1 and thephysical dataset 406-3. Lastly, the query 402-3 can be satisfied fromthe virtual dataset 404-2.

An application running on a client device may issue a query to theplatform over ODBC, JDBC, REST or other APIs. The query may include oneor more datasets residing in different data sources. For example, aquery may be a join between a HIVE table, ELASTICSEARCH index, andseveral ORACLE tables. A query for a dataset will often be acceleratedby using an optimized data structure anchored to a dataset. As indicatedabove, the optimized data structures may involve a raw reflection or anaggregation reflection. The raw reflection can include a projection ofone or more columns of a dataset. The data may be sorted, partitioned,or distributed in different columns of a dataset. The aggregationreflection may include an aggregation of columns of a dataset. Theaggregate dataset is defined by dimensions and measures, and containsaggregate-level data for each of the measures such as count, sum, minand max. The data may be sorted, partitioned, and distributed indifferent columns of a dataset.

As indicated above, although the platform can autonomously decide andautomatically generate an optimized data structure, there may becircumstances in which a user desires to create a custom optimized datastructure. In such instances, the platform allows the user to simplycreate a new optimized data structure with an SQL query that defines adesired materialization such as, for example, create a single rawreflection that includes all columns of a specific dataset.

The content of optimized data structures may be refreshed to update dataor remove stale data. The content may be refreshed manually orautomatically to ensure that the most current data is available forqueries. The optimized data structures may be refreshed in accordancewith full or incremental refresh processes.

The platform may maintain a directed acyclic graph (DAG) that defines anorder in which optimized data structures should be refreshed. Thedependencies can be calculated from relational algebra, and the actualrefresh start time can take into account the expected amount of timerequired to complete a refresh cycle. This approach reduces theend-to-end cycle time, as well as the compute resources required tocomplete the cycle. In addition, by leveraging one optimized datastructures to refresh another optimized data structures, the platformcan avoid accessing operational databases more than once in a refreshcycle.

In some embodiments, a user can indicate a relative staleness of datathat is permitted for use in optimized data structures. The indicationcan be a threshold value that limits relevant query results data.Accordingly, the platform can automatically determine when to refresheach optimized data structure in the autonomous memory based on thethreshold. For example, a user can indicate via the UI that relevantquery results can be at most 8 hours old.

The platform may take into account relationships to determine an optimalorder in which optimized data structures should be refreshed. Forexample, an optimized data structure X could be refreshed prior to anoptimized data structure Y if the optimized data structure Y is derivedfrom the optimized data structure X. In addition, the platform may allowa user to restrict the total number, rate, and period for refreshingoptimized data structures. For example, a refresh operation may be setto a specific time window (e.g., night time only). In some embodiments,a user can specify a schedule for refreshing optimized data structures.In some embodiments, the platform may continuously maintain theautonomous memory up-to-date based on any changes made to a data source.This can be done by consuming a database log, monitoring a directory fornew files, or running queries on the data source that return the newand/or updated records.

The platform may use multiple techniques to reduce time and resourcesrequired to obtain query results. For example, the platform may considerthe capabilities of a particular data source and a relativecomputational expense for querying the particular data source. In someembodiments, the platform can define a query plan that executes stagesof the query execution at the data source or the platform's distributedexecution environment to achieve the most efficient execution. Inanother example, the platform can accelerate a query execution by usingavailable optimized data structures in the autonomous memory forportions of the query when this produces the most efficient overallquery plan. In many cases, execution of a query plan can be orders ofmagnitude more efficient when only querying optimized data structuresrather than querying any underlying data sources.

The platform may be able to push down processing into relational andnon-relational data sources. Non-relational data sources typically donot support SQL and have limited execution capabilities. For example, afile system cannot apply predicates or aggregations. On the other hand,MONGODB, can apply predicates and aggregations but does not support alljoins. The optimizer considers capabilities of each data source and, assuch, the platform will push as much of a query to the underlying sourceas possible when it is most efficient and performs the rest in its owndistributed execution engine.

The platform may offload and protect operational databases. Mostoperational databases are designed for write-optimized workloads.Furthermore, deployments must address stringent service level agreements(SLAs), and any downtime or degraded performance can significantlyimpact the business. As a result, operational systems are frequentlyisolated from processing analytical queries. In these cases, theplatform can execute analytical queries by using optimized datastructures, which provide the most efficient query processing possiblewhile minimizing impact on the operational system.

Embodiments include a portal for a user to interact with the platform.The portal may be a network portal including a user interface (UI) thatcan facilitate data management and preparation operations by users. Forexample, a user can create a virtual dataset by using a visual dataseteditor view of the portal. An example of a portal is a network portalsuch as a web browser displaying a graphical UI (GUI) includinggraphical controls for users to submit queries, access datasets, preparevirtual datasets, receive query results, and the like. For example, aGUI can include clickable links to physical datasets and buttons toinitiate creation of a virtual dataset based on other datasets.

Accordingly, the portal may enable users to manipulate a dataset. Inaddition, the portal may display an interactive data graph similar tothat shown in FIG. 4 . For example, each node in the data graph canrepresent a dataset. A virtual dataset can have both incoming andoutgoing edges, and a physical dataset would only have outgoing edgesbecause any physical dataset is in a data source.

In some embodiments, a user can select a node from the data graph toedit the underlying dataset. For example, the user can change a column'sdata type, flatten a nested structure, rename a column, or extract aportion of a column into a new column. In addition, a user can selectother datasets to combine with a current dataset. Each suchtransformation updates the definition for the transformed virtualdataset. In some embodiments, the definition of a virtual dataset can beexpressed as a SQL “SELECT” statement, and the user can edit thedefinition directly. Once the user is satisfied with the resultingdataset, the user can save the virtual dataset by specifying a name andlocation in a hierarchical namespace for that dataset. From that pointon, other virtual datasets can be derived from the named virtualdataset.

In some embodiments, a user can also upload files to the platform inaddition to physical datasets residing in data sources such as databasesand file systems. For example, a user can upload an EXCEL spreadsheet,which is then stored and exposed as a dataset of the platform. Forexample, assume that the platform is connected to a NOSQL database orHADOOP cluster with an extremely large dataset. The user may want toreplace specific values in one column (e.g., to solve a data qualityissue). The user could upload an EXCEL spreadsheet with two columnsincluding old values and new values, respectively, and then create avirtual dataset as a join between the large dataset and the EXCELspreadsheet.

Thus, management and preparation features of the platform are designedto enable self-service by a user to create and modify virtual datasetsand/or cause creation of optimized data structures without needingspecialized technical knowledge or skills and without needing to defineschemas. In some embodiments, a user can simply interact with data in aspreadsheet-like interface. In addition, multiple users may collaborateby building on one another's virtual datasets.

Integration of Data Sources

The platform may harmonize query execution operations across diversedata sources including local and remote data sources. Further, theplatform can access data distributed across multiple data sourcesincluding relational and non-relational data sources. For example, theplatform can retrieve data from different data sources and combine thedata to produce final query results that satisfy a query.

The platform may have a scale-out architecture. It can scale from oneserver to thousands of servers in a single cluster. The platform may bedeployed on dedicated hardware or on shared infrastructure such asHADOOP clusters, private clouds or public clouds. For example, theplatform can be deployed on a HADOOP cluster when using the platform toanalyze data in HADOOP. This enables the platform to achieve datalocality for raw data and the optimized data structures contained in theautonomous memory.

The platform cluster has two distinct roles: coordinators and executors.Each role can be scaled independently. Coordinators are nodesresponsible for coordinating query execution, managing metadata andserving the portal. Client applications, such as BI tools, connect toand communicate with coordinators. Coordinators can be scaled up toprocess more clients concurrently. Executors are nodes responsible forquery execution. Client applications do not connect to executors.Executors can be scaled up to process larger data volumes and moreconcurrent queries.

When running the platform on HADOOP, the coordinators could be deployedon edge nodes so that external applications such as BI tools can connectto them. Furthermore, there is no need to manually deploy the platformon the HADOOP cluster because the coordinators can use YARN to provisionthe compute capacity on the cluster. To maximize performance, every nodein the cluster may have an executor.

FIG. 5 is a flowchart illustrating processes of the platform accordingto some embodiments of the present disclosure. In step 502, the platformis connected to one or more data sources. In step 504, a user devicedisplays a self-service portal of the platform on a display of the userdevice. For example, the user may open a web portal administered by theplatform that enables the user to query data sources connected to theplatform in step 502. In step 506, a user may use a visual dataseteditor of the portal to create a virtual dataset and/or share thevirtual dataset with other users. The process for creating and sharingvirtual datasets is optional and described elsewhere in this disclosure.

In step 503, the platform may determine or receive statistics orcapabilities information of data sources. The statistics or capabilitiescan be used to formulate an optimal search plan for executing a query asdetailed below. In some embodiments, the statistics or capabilitiesinformation can be obtained after receiving the query and stored in alocal memory. As such, the platform can retrieve the statistics andcapabilities information from the local memory when needed to define aquery plan.

In step 508, the platform generates an optimized data structure. Thedecision to generate the optimized data structure may be autonomous andthe process for generating the optimized data structure can beautomatic. In some embodiments, the decision to generate the optimizeddata structure may be based on user input. In some embodiments, thedecision and/or process for generating the optimized data structure canbe based on a combination of autonomous, manual, or automatic steps.

In step 510, the platform receives a query from the user device. Thequery may refer to physical datasets and/or virtual datasets. The querymay be received from a client device over a network coupled to acoordinator of the platform. In some embodiments, the query is receivedvia ODBC, JDBC, or REST.

In step 512, the platform executes a process to define a query planbased on the received query. For example, the query plan can define theoperators that make up the computation for the query.

A supported operator may refer to an operator that a data source iscapable of supporting. For example, NOSQL databases and query enginesonly support a subset of operators required to implement complexqueries. For example, MONGODB can perform aggregations but not joins. Insome cases, capabilities of a data source depend on the how data isorganized by the data source. For example, ELASTICSEARCH cannot performan equality filter on a field that is tokenized.

A code generation refers to code that a data source is capable ofgenerating. For example, the disclosed platform can leverageextensibility mechanisms of NOSQL databases. When interacting with anELASTICSEARCH cluster, portions of the query may be compiled into GROOVYor PAINLESS scripts that are injected into the ELASTICSEARCH query. Wheninteracting with a MONGODB cluster, portions of the query may becompiled into JavaScript code that runs in MONGODB's MAPREDUCEframework.

An execution speed may refer to a speed at which data sources canexecute operators. That is, some data sources may execute certainoperators faster than others. For example, ELASTICSEARCH projections arerelatively slow such that it is usually better to pull an entire recordto perform a projection. As such, the query plan may execute a portionof the query on a data source that operates faster than others.

A data distribution may refer to how data is distributed across datasources. That is, the query plan can be configured to take advantage ofhow data is organized in a data source. For example, data contained inthe data source may be organized in such a way that querying that datasource reduces the overall amount of time required to obtain queryresults included on that data source. For example, ELASTICSEARCHparent-child relationships can collocate matching records from differentdatasets such that a join between the two datasets does not require anydata shuffling. As such, a query plan may prefer to execute a portion ofa query on a data source that reduces query time by avoiding shufflingover a data source that does not avoid shuffling.

A network throughput and latency consideration refers to whether networkthroughput and latency affects the process of obtaining query results.For example, a query plan may give preference to performing queryoperations locally at the data sources to a greater extent. For example,if a slow network exists to a particular data source, the query plan canbe configured to push down operations to the data source and transferthe query results after executing the pushed-down operations.

For example, the platform may normally be able to aggregate data fasterlocally compared to aggregating the data at a data source. Under normalnetwork conditions, it would be preferable to receive data from the datasource and aggregate the data locally at the platform. However, underslower network conditions, it is preferable to aggregate data at thedata source and transfer the aggregated data over the network ratherthan transferring un-aggregated data for aggregation by the platform. Onthe other hand, if daemons of the platform are co-located with datasources, the cost of transferring data is relatively low such that thelocation where data is aggregated would not negatively affect queryexecution.

A data source SLA consideration refers to constraints imposed by a SLA.For example, a user may want to minimize the load of queries on adatabase to avoid violating a SLA. In these instances, the platform cantake the SLA into account when deciding what portion of a query shouldbe applied to the database to avoid breaching its SLA.

In some embodiments, the query plan can be defined to carry outoperations in a distributed mode. The portions of the query can beexecuted across a number of nodes and may be executed in a phasedmanner. As such, operations distributed across non-relational datasources can be parallelized in phases. For example, a query may includean aggregation operation. A query optimizer of the platform maydetermine that the aggregation can be applied to a MONGODB database.Rather than requesting the MONGODB database to perform the entireaggregation, the query plan may require each MONGODB node (i.e., MONGODdaemon) to perform a local aggregation on a single shard of data, andeach of these local aggregations are returned to a potentially differentthread in the platform cluster. This allows execution to continue inparallel in the cluster.

The platform may define the query plan based on the relational model andother considerations. For example, the query plan may be defined basedon the collected information including the functional abilities of thedata sources and other parameters described elsewhere in thisdisclosure. For example, a query plan may employ execution capabilitiesof a data source at a query planning phase, optimization phase, or queryexecution phase.

In step 516, the query plan is modified to utilize optimized datastructure(s) in lieu of querying the data sources directly for queryresults. In some embodiments, the original process for defining thequery plan may consider any available optimized data structuresincluding their ordering, partitioning, and distribution. In otherembodiments, the defined query plan is re-defined based if optimizeddata structures that could accelerate query execution have beenidentified. In some embodiments, the query plan may be broadened,modified to use optimized data structures, and/or modified to use theperformance statistics obtained in step 503.

In step 518, execution of the modified query plan begins with executorsobtaining data into buffers from data sources. The executors can readphysical datasets contained in remote data sources and/or optimized datastructures contained in the autonomous memory. The data is read whileexecuting the modified query plan with a distributed query engine of theplatform and/or the data sources.

In step 520, the platform obtains query results that satisfy thereceived query. Further, obtaining query results from multiple datasources can be performed in parallel and/or in phases. The data may beobtained from optimized data structures in an autonomous memory (e.g.,PARQUET files) and/or the underlying datasets. When reading from a datasource, the executor can submit native queries (e.g., MONGODB QueryLanguage, ELASTICSEARCH Query DSL, MICROSOFT TRANSACT-SQL) as determinedby the optimizer in the planning phase.

In some embodiments, intermediate query results obtained from datasources and/or the autonomous memory are combined to produce final queryresults that satisfy the query. For example, one executor can merge datafrom other executors to produce the final query results. The merged datacan be streamed as final query results to a coordinator of the platform.

In some embodiments, the platform can use high-performance columnarstorage and execution powered by APACHE ARROW (columnar in memory) andAPACHE PARQUET (columnar on disk). APACHE ARROW is an open sourceproject that enables columnar in-memory data processing and interchange.In some embodiments, the execution engine of the platform can use APACHEARROW. The data in memory can be maintained in the ARROW format, andthere could be an API that returns query results as ARROW memorybuffers.

APACHE PARQUET is an open source project that enables columnar datastorage. It has emerged as a common columnar format in HADOOP and cloudcomputing ecosystems. Unlike APACHE ARROW, which is optimized forin-memory storage and efficient processing in CPU, PARQUET is optimizedfor on-disk storage. For example, it utilizes encoding and compressionschemes, such as dictionary and run-length encoding, to minimizefootprint and I/O. The platform may include a high-performance PARQUETreader that reads PARQUET-formatted data from disk into ARROW-formatteddata in memory. The PARQUET reader enables fast processing of raw dataas well as reflections in a cache. Further, it includes capabilitiessuch as intelligent predicate push-downs and page pruning, in-placeoperations without decompressing data, and zero memory copies.

In step 522, the client device receives the final query results from theplatform. The final query results may be rendered as text or avisualization, persisted, or a calculation may be performed on the finalquery results (e.g., step 526). The user can also operate the portalconnected to the platform to view, manipulate, and analyze the finalquery results.

In step 524, the platform can generate optimized data structures basedon the final query results rather than the datasets of the data sources.The decision to generate the optimized data structure may be autonomousor based on user input. Further, the process for generating theoptimized data structures may be automatic. The newly created datastructures can be used to accelerate subsequent query executions basedon subsequent received queries.

Although FIG. 5 illustrates a particular series of steps, the disclosureis not so limited. Instead, many of the steps described above can occurin different orders or omitted. For example, the user device need notvisualize query results as shown in step 526 and the platform need notgenerate any optimized data structures based on query results. Further,the platform may receive the performance statistics of step 503 at anytime or never obtain these statistics at all.

Dynamic and Unknown Schemas

Embodiments of the platform can handle dynamic schemas and schemauncertainty. A dynamic schema refers to a schema for a data source thatchanges or a data source that contains mixed data types. Dynamic schemasare common in non-relational data stores. For example, a data sourcesuch as MONGODB may have a single column including values for differentdata types (e.g., varchar, map, integer).

The platform can handle mixed data types of a data source. For example,the platform can determine that a column or field has a mixed data type.If so, it can cause display of a visual indicator to a user and enable auser to address the situation by modifying the column to include only asingle data type or split the column into multiple columns withrespective data types. The platform can then support BI tools and otherSQL-based applications that require each column to have a single datatype. Accordingly, the disclosed platform provides a way for users toprepare data for analysis with BI tools.

This “single data type” approach converts mixed data types into a singledata type. The user may specify a desired data type, and the platformeither deletes entries with values that cannot be converted into thespecified data type or replaces those values with a specified value suchas “NULL.” On the other hand, a “split by data type” approach splits acolumn with mixed data types into a column for each data type. Forexample, if the column foo has a mix of map and text, the column issplit into a foo_map column and a foo_text column. The platform can usethe value NULL in place of the missing values in each column.

Schema uncertainty refers to the situation where a schema of a datasource is unknown to a system attempting to query that data source. Forexample, the schema may not be explicitly defined for a data source. Assuch, the platform has no way of knowing the structure of datasetscontained in the data source prior to executing query operations on thedata source. As a result, the platform must operate with schemauncertainty.

Existing systems may assume that all data has an explicitly definedschema, particularly in approaches to data virtualization. Thisassumption holds when dealing with relational data sources but not whendealing with non-relational data sources such as NOSQL databases andfiles. For such data sources, existing approaches rely on either anadministrator to manually specify a schema before data could be queried,or to examine a sample of data stored in the data store to approximate aschema.

The platform can compensate for schema uncertainty with “schemalearning” or as “schema pass-through.” That is, the platform can handlethe situation where a data source does not advertise a schema for use bythe platform. For example, MONGODB collections and JSON files areeffectively schema-free. Further, when querying a file directory (e.g.,in S3, HDFS), even if each file has a defined schema, files are added orremoved to or from the directory such that the overall schema maychange.

In schema learning, the platform can automatically learn a schema for adataset. Initially, the platform estimates a schema based on sample datafrom a data source. A sample may be incomplete when it does not reflecta schema of all data. Hence, there is no guarantee that the estimatedschema is accurate or complete. As the platform executes queries, thecurrent schema is updated when the query execution engine observes datathat does not match the currently assumed schema. In some cases, thequery execution may not be able to continue because the query wasoriginally compiled based on the assumed schema. In these cases, theplatform automatically recompiles the query and restarts the executionwith a newly learned schema.

In schema pass-through, the platform can propagate changes of a datasetwith a dynamic schema. For example, a user may want to add a calculatedfield as a new virtual column to a MONGODB collection. In this case, thevirtual dataset should include all the fields in the MONGODB collectionas well as the additional calculated field. If new fields are added tothe MONGODB collection, the virtual dataset could reflect additions. Toenable schema pass-through, the platform can support a way to performwildcard-selection of columns in a virtual dataset. For example,“SELECT*, stars+10 FROM mongo.yelp.business” adds one additional columnto all (*) the columns in the parent dataset (mongo.yelp.business). Insome cases, rather than adding a column, a user can transform one columnwhile allowing any other column to pass-through. For example, a querycould be “SELECT*EXCEPT city, UPPERCASE(city) AS city FROMmongo.yelp.business”. The expression “*EXCEPT city” refers all thecolumns except city. The query could also be: “SELECT UPPERCASE(city) AScity, . . . FROM mongo.yelp.business” where “ . . . ” represents allcolumns other than those with the same names as those explicitlymentioned. There are many other syntaxes that could be used to allowpass-through of columns that are not explicitly called out.

Namespace and Path Structures

Embodiments of the platform include a hierarchical namespace forphysical or virtual datasets. The path to a location of a dataset isindicated with a pathname including path components separated by dots orslashes. For example, production.website.clicks may refer to a MONGODBcollection named “clicks” stored in a MONGODB database named “website”in a MONGODB cluster named “production.” The “production” can be definedby a user when establishing a connection to this data source, and the“website” and “clicks” can be obtained from the MONGODB cluster.

A path to a dataset for file-based data sources (e.g., HDFS and S3) mayinclude a variable number of path components. A dataset in such a sourcemay reflect a single file or a directory of similarly structured files(e.g., a directory of log files). For example, a path to the file ordirectory at /path/to/clicks in a HADOOP cluster could bedatalake.path.to.clicks if the connection to the HADOOP cluster wasestablished with the name datalake.

The platform can expose dataset information when connected to datasources. For example, the disclosed platform must expose metadatainformation such as schemas and tables to SQL-based clients such asTABLEAU, which requires knowing about all datasets that can be queried.For a database-style data source (e.g., ORACLE, MONGODB, orELASTICSEARCH), elements of the database that are queryable are known.Examples of such elements that reflect queryable datasets include ORACLEtables, MONGODB collections and ELASTICSEARCH types, or the like, whichcan provide a fast way to retrieve the list.

Identifying queryable datasets in file-based data sources (e.g., HDFS,S3) can be challenging because each and every file or directory is notnecessarily queryable. For example, image and video files in afile-based data source cannot be queried as a dataset. In addition, arapid way to retrieve a list of all queryable files typically does notexist. As such, the platform would have to traverse the entire filesystem, which is computationally prohibitive because it might containmillions or billions of files.

To overcome these challenges, the platform can learn over time what datain a file system can be treated as queryable datasets. The platform isnot aware of the files or directories in the data source because thereis no record of the files or directories in the metadata store of theplatform. Thus, all files or directories of the data source can beinitially presumed as non-queryable datasets.

The file system can receive an external query such as one that wassubmitted through a third party client application that references anon-queryable dataset file or directory. If the platform is able to readthe data in that file or directory, it is thereafter considered aqueryable dataset. This could include intelligent automaticidentification of file formats based on file extensions (e.g., “.csv”)and data profiling. If the platform is unable to read the data in thatfile or directory, an error is returned to the client application, andthe file or directory remains a non-queryable dataset.

If the platform receives a query through its own portal that referencesa non-queryable dataset file or directory, the platform can prompt theuser to specify format options (e.g., line and field delimiters in thecase of text files) and can then run the query. The platform may decidenot to prompt the user if the file or directory is self-describing andthere is no need for any format options. The file or directory isconsidered a queryable dataset once the user provides the formatoptions. In some embodiments, the user can explicitly mark a file ordirectory as a queryable dataset through its own portal or API. Inaddition, specific interactions such as clicking on a file in the portalcan also have the effect of making a non-queryable dataset into aqueryable dataset. In some embodiments, the user can also convert a fileor directory into a non-queryable dataset.

The platform can learn over time which files or directories of a datasource represent queryable datasets. Any file or directory (includingnon-queryable datasets) can be queried via a SQL query at any time.However, only known queryable datasets are advertised in the metadatathat is returned to external client applications. For example, a TABLEAUuser is only able to select known queryable datasets in the TABLEAU UI,although the user is able to use TABLEAU's Custom SQL feature to queryeven a non-queryable dataset file or directory.

Data Intelligence

Embodiments of the platform enable users to interact with diverse datatypes from diverse data sources with relative ease. A user can interactwith any datasets or any data sources via a portal provided by theplatform. For example, the user can readily transform datasets, joindatasets with other datasets, and the like. Typically, only a limitednumber of users are sufficiently technically savvy to specify exactlywhat operations need to be performed. To improve ease of use, theplatform can be designed to make suggestions to users. For example, if auser is viewing a dataset and clicks a join button, the platform canautomatically recommend other datasets that can be joined with theviewed dataset, as well as the join conditions (e.g., a formula for howto perform the join).

As users interact through a portal or through external systems viainterfaces such as JDBC and ODBC, the platform can use machine learning(ML) and artificial intelligence (AI) to identify query patterns. TheML/AI allows the platform to present users with better choices, identifyboth ideal patterns and patterns to be avoided. Examples include joinrecommendations, common multistep preparation and transformationpatterns, common value filtering, aggregation patterns, frequentcalculated fields, preexisting datasets that might include what isdesired (i.e., “similar datasets”)

Security and Governance

Security and data governance are critical to any enterprise. However,the increasing complexity and demand for data often leads to datasprawl, resulting in significant security risks. The platform enablesusers to discover, curate, accelerate, and share data without having toexport copies of data into ungoverned systems including disconnectedspreadsheets, BI servers, and private databases. This reduces the riskof unauthorized data access as well as data theft. In particular, theplatform provides a virtual security layer with security and governancecapabilities including lineage, authentication, access control,auditing, and encryption.

The platform may maintain a record of lineage for every datasetincluding those personal to a user. As such, an administrator can easilydetermine how a dataset was created, transformed, joined, and shared, aswell as the full lineage of these steps between datasets. That is,datasets are connected and the administrator (or any other permitteduser) can browse ancestors and descendants of each dataset and column.

The platform may support multiple authentication modes. For example,user accounts can be managed inside the platform in an embedded mode. Incontrast, the platform can also connect to an existing LDAP-baseddirectory service such as Active Directory. The platform can rely on thedirectory service for verifying credentials and checking groupmembership in an LDAP/Active Directory mode.

The platform can also support granular access controls includingphysical dataset permissions that control which users and/or groups canquery a specific physical dataset, and virtual dataset permissions thatcontrol which users and/or groups can query a specific virtual dataset.The access controls may include column-level permissions used torestrict access to sensitive columns in a dataset by specific users. Thecolumn-level permissions can be set via SQL or the UI. The accesscontrols may also include row-level permissions that can be used torestrict access to a subset of the records in a dataset for specificusers. The row-level permissions can also be set via SQL or the UI.

A user can also set access controls at a column or row level by creatinga virtual dataset. For example, an owner of a database can create aconnection to the database. However, rather than allowing others toquery the physical datasets of that database, the owner can derivevirtual datasets that limit exposure of the physical datasets. Forexample, the owner may create a derived virtual dataset that includesonly a subset of columns or rows of a physical dataset. The owner candeny access to the physical dataset while enabling access to the derivedvirtual dataset. As such, the derived virtual dataset effectively limitsaccess to data of the physical dataset. In addition, a maskinginstruction could operate to mask specific columns or rows of a dataset.As such, exposure of data can be limited per user and at differentlevels of granularity.

The platform may enable users to collaborate and selectively share orgrant access to data. A user of a dataset (e.g., owner or administrator)can designate users or groups that can query that dataset. The platformcan transparently construct a complex SQL statement based on the accesscontrol setting such that data returned to different users may differdepending on the identities of the users. For example, a UI may enablethe owner of a physical or virtual dataset to set levels of access tothat dataset for different users or groups. For example, the user canspecify that users that belong to a marketing department group can onlysee records that meet a specific condition or only see a masked versionof a credit card column.

The disclosed access control capabilities may include user impersonationto enable controlling which users can query a particular dataset. Inparticular, a data source or virtual dataset can be setup in owner modeor impersonation mode. In owner mode, the identity of the data source ordataset owner is used. In impersonation mode, the identity of the enduser is used. In some embodiments, impersonation mode can support usingthe identity of the child dataset that is being queried (which, in turn,may be in owner or impersonation mode).

When establishing a data source in owner mode, access to a data sourcemay require the owner of the data source (i.e., whoever defines thatdata source) to provide master credentials. However, when establishing adata source in impersonation mode, the owner does not need to providemaster credentials to access the data source because the identity of theend-user is used to access the data source.

Although some data sources (e.g., HADOOP) allow impersonation withoutany credentials via a trust relationship, some data sources (e.g.,MONGODB, S3) require credentials. For sources that require credentials,there are two ways in which to obtain the required credentials toimpersonate a user to a data source.

In one embodiment, a required credential can be obtained and maintainedper session. If the platform determines that it needs to access a datasource with the identity of an end user, the platform may be able to usethe credentials with which the user authenticated to the platform whenlogging in through the platform's UI or connecting with a BI tool. Thesecredentials are maintained by the platform in the context of the user'ssession with the system, and do not need to persist long-term.

In another embodiment, the required credential can be obtained andmaintained in a keychain. The password for a user to connect to aplatform does not necessarily work for some data sources. For example,S3 uses randomly generated keys rather than passwords such that theuser's password, as well as the username, does not work to access filesin S3. In some embodiments, the platform maintains a multiuser keychainas a sparse matrix that holds the credentials for each <data source,user> tuple. When the platform seeks to query a data source with theidentity of a specific user, it consults the keychain to see if thekeychain contains credentials for that data source and user. If not, theplatform can prompt the user to enter credentials for that data source.

The platform may have auditing capabilities. This allows a user tomonitor who is accessing particular data and identify the time when thedata was accessed. In some embodiments, the platform can generatereal-time reports showing, for example, the top 10 users of a givendataset, and off-hours access. The platform can track and record useractivity, including all query executions. This serves as a single viewthat shows who is accessing what data. For example, a Jobs section ofthe UI can provide details of all query executions, enabling ITprofessionals to monitor the system for suspicious activity and identifyinstances of unauthorized data access.

The platform may have encryption capabilities. For encryption on thewire, the platform can leverage both TLS (SSL) and KERBEROS. For eachdata source, the platform can support the standard wire-level encryptionscheme of the source system. For example, when connecting to a secureHADOOP cluster, the platform can communicate securely with the HADOOPservices via KERBEROS. For encryption at rest, the platform can leverageencryption capabilities of the autonomous memory (e.g., HDFS, AMAZONS3). When the autonomous memory is on direct attached storage (i.e., thelocal disks of the cluster), encryption can be provided viaself-encrypting drives or encryption at the operating system level.

Autonomous Memory

As indicated above, the disclosed embodiments include an autonomousmemory (also referred to as a reflections data store) configured tomaterialize data of data sources. The materialized data may enable theplatform to produce query results from the data sources without needingto connect to the data sources. Thus, the reflections data storecontains the optimized data structures that are used at query time whenavailable rather than relying solely on the data sources.

In some embodiments, the reflections data store is a persistent cacheused to accelerate query executions. The cache can live on HDFS,MAPR-FS, cloud storage such as S3, or direct-attached storage (DAS). Thecache size can exceed that of physical memory. This architecture enablescaching more data at a lower cost, resulting in a much higher cache hitratio compared to traditional memory-only architectures.

FIG. 6 is a block diagram illustrating an acceleration system for theplatform 602 according to some embodiments of the present disclosure.The acceleration system may include an autonomous memory 608 configuredfor use to accelerate a query execution process. In some embodiments,the platform 602 may include or is separate from the autonomous memory608, the acceleration system can include the platform 602, can beincluded in the platform 602, or can be separate from the platform 602.

The platform 602 is communicatively coupled to one or more data sources604 and one or more user devices 606. An autonomous memory 608 iscommunicatively coupled to the data sources 604 and platform 602. Theautonomous memory 608 may include a combination or cluster of in-memorystorage, on-disk storage, a distributed file system (e.g., HDFS), a blobor object store (e.g., AMAZON S3), or a database. The autonomous memory608 contains optimized data structures 610.

The query results of a query are more rapidly obtained by using theautonomous memory 608 because its optimized data structures 610 areoptimized for complex queries on datasets from data sources 604 such asrelational and non-relational data sources. Further, the autonomousmemory 608 can be local to the platform 602 to avoid network issues suchas bottlenecks, latency, and the like. The process for obtaining queryresults is accelerated relative to applying the query exclusively to thedata sources 604. In particular, upon receiving a query for datacontained in the data sources 604, the autonomous memory 608 can bequeried in lieu of querying the data sources 604 or in addition toquerying the data sources 604. Hence, the process for obtaining queryresults is accelerated because the platform 604 can avoid queryingremote data sources.

The query operations involving the autonomous memory 608 are transparentto users of the user devices 606 that submit queries. In contrast, knowntechniques for improving query performance (e.g., OLAP cubes,aggregation tables, and BI extracts) require a user to explicitlyconnect to optimized data. Accordingly, users of user devices 606 cansubmit queries to the platform 602, which can automatically andtransparently accelerate query execution by using any optimized datastructures 610 available in the autonomous memory 608.

FIG. 7A illustrates a view of a display 700 rendered on a display of auser device 606 according to some embodiments of the present disclosure.As shown, the display 700 has a dataset settings window 702 includingcontrols (e.g., control 704) that allow a user of the user device 606 tomanage settings of optimized data structures (e.g., reflections). Asshown, the dataset settings window 702 is rendered by a browser 706running on the user device 606 connected to the platform 602 over anetwork such as the Internet. The dataset settings window 702 shows araw reflections 708. A user of the user device 606 can view or modifythe raw reflections 708 by using controls via the display 700, in whichcase the platform creates and stores the changed raw reflections 708 inthe autonomous memory 608 for subsequent use during query time.

FIG. 7B illustrates another view of the display 700 rendered on thedisplay of the user device 606 according to some embodiments of thepresent disclosure. As shown, the display 700 has a dataset settingswindow 710 including controls (e.g., control 712) that allow a user ofthe user device 606 to manage settings of optimized data structures. Asshown, the dataset settings window 710 is rendered by the browser 706running on the user device 606 connected to the platform 602 over anetwork such as the Internet. The dataset settings window 710 showsaggregate reflections 714. A user of the user device 606 can view ormodify the aggregate reflections 714 by using the controls via thedisplay 700, in which case the platform creates and stores the changedaggregate reflections 714 in the autonomous memory 608 for subsequentuse during query time.

As indicated above, the platform can include an optimizer configured tooptimize a query plan defining the execution of a query. The optimizercan explore opportunities to utilize optimized data structures containedin the autonomous memory rather than processing the raw data in the datasources at query time. In some embodiments, the query results satisfyinga query can be obtained more rapidly from the autonomous memory ratherthan the data source. Further, the computational cost is reduced byutilizing autonomous memory in lieu of data sources. As such, theoptimizer may consider all the optimized data structures in theautonomous memory when a new query is received and automatically definea query plan to utilize optimized data structures when possible.

In some embodiments, the optimizer may include a two phase algorithm. Ina pruning phase, the optimizer disregards any optimized data structuresin the autonomous memory that are irrelevant because their logical planshave no physical datasets in common with the query's logical plan. Inother words, any optimized data structures that are not based onphysical datasets within the scope of the query are excluded from thequery plan. In a sub-graph matching phase, the optimizer uses ahierarchical graph algorithm to match sub-graphs of the query's logicalplan with logical plans of any remaining optimized data structures.

FIG. 8 is a flowchart illustrating a process 800 for accelerating queryexecution according to some embodiments of the present disclosure. Thequery execution can be accelerated by using optimized data structures(i.e., reflections) contained in an autonomous memory (i.e., reflectionsdata store). Hence, in step 802, data is materialized for the autonomousmemory as optimized data structure(s). For example, an algorithm candetermine the data to materialize in the autonomous memory and/or datacan be materialized based on user input that explicitly indicates whatto materialize. The process for deciding and generating materializeddata is described elsewhere in this disclosure.

In step 804, the platform receives a query from a client device. Forexample, the platform may receive a query referring to datasets of datasources, and/or virtual datasets derived from datasets of data sources.In step 806, the platform defines a query plan based on the receivedquery. The query plan may refer to physical and/or virtual datasets. Theplatform can expand the virtual datasets based on their definitions,resulting in a query plan that refers to physical datasets only. Inother words, the platform can recursively substitute virtual datasetswith their definitions.

In step 808, the platform determines whether query execution can beaccelerated by using optimized data structures of the autonomous memory.In some embodiments, the platform checks whether there are optimizeddata structures in the autonomous memory that have not expired and whichcan be used to partially or entirely satisfy the query. For example,assume that the autonomous memory contains an optimized data structure Xwhich corresponds to ‘SELECT name, city FROM mongo.yelp.business’. Ifthe platform executes a query ‘SELECT name FROM mongo.yelp.business’,the platform identifies the opportunity to use the optimized datastructure X. A projection is trivial if the cache is columnar.Similarly, if the platform executes the query ‘SELECT name, COUNT(*)FROM mongo.yelp.business GROUP BY name’, the platform identifies theopportunity to use the optimized data structure X. In this instance, theplatform could perform an aggregation (GROUP BY name) and count on X.

In step 808, if the query execution can be accelerated, the platformdetermines whether the query results can be obtained entirely from theautonomous memory. In step 810, if the query can be computed entirelybased on reflections in the autonomous memory, the query plan ismodified to include optimized data structures of the autonomous memoryand the platform does not need to access the data sources. In step 812,the platform utilizes the autonomous memory to compute results thatsatisfy the query. In step 814, the results are returned to the clientdevice.

In step 810, if the query execution cannot be accelerated, the queryplan is defined to read data from the data sources to compute the queryresults. The query execution can still be improved without the benefitsof the autonomous memory. For example, the scope of the query plan caninclude a distributed execution to leverage columnar in-memoryprocessing and push-downs into the underlying data sources when seekingdata from RDBMS or NOSQL data sources.

In step 812, if the query execution can be accelerated, the query planis modified with selected optimized data structures to facilitate theacceleration. In step 814, the platform determines whether the modifiedquery plan only needs to read the selected optimized data structures toobtain query results or needs to read the data contained in the datasources as well.

In step 816, if query results can be obtained by reading data from theautonomous memory and data sources, the query plan is modified toinclude a combination of the optimized data structures contained in theautonomous memory and the datasets contained in the data sources. Thiscan also occur when the platform determines that the overallcomputational cost of the query is lessened by using the data sourcesuch as, for example, when the data source has an index. Hence, thequery execution is accelerated by reading the selected optimized datastructures in lieu of reading at least some data from the data sources.

The platform then queries the autonomous memory to obtain intermediatequery results and queries the data sources to obtain the remainingintermediate query results. In some embodiments, the intermediate queryresults from the data sources are obtained before obtaining theintermediate query results from the autonomous memory. In someembodiments, the intermediate query results can be obtained in parallelfrom the autonomous memory and the data sources.

In step 818, if the query results can be obtained exclusively by readingfrom the autonomous memory, the query plan is modified to include onlythe selected optimized data structures. Hence, the query execution isaccelerated by reading only the optimized data structures whileexecuting the query. Lastly, in step 820, the data (i.e., intermediatequery results) obtained by reading from the autonomous memory and/ordata sources is merged, finalized, and returned to the user as finalquery results that satisfy the received query.

The platform can intelligently decide when to generate optimized datastructures and what data of the data sources to materialize in theautonomous memory. For example, the platform can make the determinationbased on user input and/or an algorithm that considers various factorsor constraints. For example, the input to the algorithm can includeinformation related to historical queries, historical query patterns orsessions, real-time or current queries, explicit votes (e.g.,crowd-sourcing). The platform may also consider physical constraints(e.g., available memory, resource consumption, and desired queryruntime) and policy constraints set by an administrator. Severalconsiderations for materializing data follow.

The platform may automatically materialize frequently queried data ordata related to the frequently queried data as optimized data structuresin the autonomous memory. Specifically, the platform can determine whatdata and queries are more common than others. The platform canmaterialize the frequently queried data or data related to thefrequently queried data to improve query capabilities with rapid accesscompared to accessing the data sources directly.

As indicated above, the platform may also materialize data based on userhints. The user hints may be explicit instructions to accelerate queriesreferring to particular physical or virtual datasets. For example, auser can request to accelerate queries on a particular dataset bycreating optimized data structures anchored to that dataset. Optimizeddata structures in this case are derived from the particular datasetsuch that queries for that dataset can be applied to the autonomousmemory rather than the data source.

The platform may materialize data based on relationships betweendatasets. The platform knows the relationships between any datasets, andcan use this information to determine optimized data structures tocreate. In particular, the relationships between virtual and physicaldatasets can influence which optimized data structures the platform willcreate. The platform can use cost-based factors or heuristics. Forexample, a minimum-cost threshold can be applied to avoid materializingdata if the cost of computing that data on-the-fly is low (e.g.,performed quickly). For example, if an underlying dataset ismaterialized and a derived dataset can be computed based on underlyingdataset at a low enough cost, then the derived dataset is notmaterialized even if the user requests to materialize the deriveddataset.

A delta materialization refers to when a virtual dataset has most of itsdata in common with another physical or virtual dataset that has beenmaterialized. If so, the platform may decide to materialize only thedifference between the two datasets (i.e., the delta). For example, if Ais already materialized and B is identical to A plus an additionalcolumn, then the platform will only materialize the additional column inB to accelerate B. In some embodiments, the platform can generatematerialization patterns and an ordering that minimizes materializationmaintenance cost. This includes materialization updates that can bepartitioned and are data segment aware, as well as data algebra aware.

A common ancestor materialization refers to automatically detecting acommon ancestor for multiple virtual datasets or ad hoc queries thatneed to be accelerated, and assuming the cost of computing thesedatasets or queries from the ancestor is relatively small. For example,a user may ask to accelerate B and C, but not A. The platform maydetermine that B and C can be derived from A at a sufficiently lowenough computational cost, and therefore decide to materialize A but notB or C.

The common ancestor need not be an actual named dataset; rather, it maybe any arbitrary intermediate dataset (which may or may not beexpressible in SQL). For example, assume that B is ‘SELECT COUNT(*) FROMbusiness GROUP BY state HAVING state=“CA”’ and C is ‘SELECT COUNT(*)FROM business GROUP BY state HAVING state=“NY”’. In this instance, thesystem may decide that instead of materializing both B and C, it issufficient to materialize an intermediate dataset equivalent to “SELECTstate, COUNT(*) FROM business GROUP BY state” because B and C can becheaply computed based on this intermediate dataset.

An intermediate materialization refers to identifying not only explicitdatasets that can be accelerated but can also generalize datasets orcreate intermediate datasets that did not previously exist to provideideal acceleration candidates. This can be done to balancematerialization coverage and size with required latencies consideringfactors such as result execution effort, cluster size, and the desire toimprove hit ratios for the autonomous memory.

An automatic partition and ordering materialization technique refers toautomatically identifying optimal physical data layouts includingpartitioning, dictionary encoding, and ordering to provide benefits toboth materialization maintenance, and results execution performance.

A columnar-aware acceleration refers to the platform taking advantage ofthe nature of columnar storage formats to accelerate alternativedatasets. This allows the platform to reduce dataset duplication andminimize disk read overheads. By understanding data at a column level,the platform can provide optimal performance for multiple relateddatasets in a single physical data representation. This technique canalso include stepwise distinct materialization maintenance steps. Thiscan accelerate availability of common transformation patterns anddecrease staleness. By doing this, some columns can be updated morequickly compared to other common transformations that can be deferred ormaintained at a decreased frequency.

A CPU-consumption and size-based tiered materialization technique refersto materialization that targets multiple types of storage includingspinning disk, SSD, compressed memory, expanded memory, as well as newertechnologies including persistent memory. In each instance, the platformallows a user to balance the need for execution performance with thecapacity of storage subsystems to provide an optimal scale orperformance balance.

An automatic Identification of OLAP-type patterns technique refers toidentifying clusters of common analysis patterns done through SQL orthird party tools, along with data profiling such that the platform candetermine likely relations including table relations, star-schemas, anddimension measures and use these to generate better materializationcoverage and support automatic dataset identification and caching in theautonomous memory.

A secondary storage redundancy technique refers to maintaining aconnection to primary data sources. The platform can minimizeduplication for redundancy purposes by having a strong awareness ofprimary dataset availability.

In some embodiments, the platform can generate a large number ofalternative materializations. In this instance, the platform needs toevaluate a large number of possible supporting optimized data structuresduring each query execution. To ensure the performance completion of thealternative evaluation and costing, the platform can generate analgebraic data tree structure, which refers to an in-memory and on-diskdata structure that allows for high performance evaluation andcomparison of multiple optimized data structures for incorporation. Thiscan be supplemented using common alternative caching to provide optimalquery completion performance.

Business Intelligence Tool Launcher

In some embodiments, users are enabled to collaboratively manage andprepare (i.e., curate) data via virtual datasets. By leveraging theplatform's SQL execution engine and client-side drivers, any SQL-basedapplication (e.g. BI tool) can connect to the platform and issue querieson physical and virtual datasets.

The platform may include capabilities that enable users to jump fromdata preparation inside the platform's UI to analysis with a SQL-basedapplication. This capability may be referred to as a BI tool launcher.The user can choose the desired application and click a button to launchthat application. Data is not extracted or exported from the platforminto the BI tool. Instead, the application is initiated with theparameters needed to make a live, direct connection to the platform.

The BI tool can be integrated into the platform using differenttechniques. For example, an auto-generated connection file can bedownloaded to the platform. The auto-generated connection file caninitialize the BI tool with a direct connection. For example, in thecase of Tableau, a TDS file can be created and downloaded by the user.In another example, a URL handler or browser plugin can be used. Whenthe user clicks the button (or link), the browser receives theconnection information and launches the BI tool. In yet another example,the platform can use a direct link or API calls. In the case ofserver-side or web-based BI tools, the user can be redirected to aspecial URL which includes the connection information. Alternatively,the platform can use the BI tool's API to configure the connection.

Computer System

FIG. 9 is a block diagram of a computer system 900 as may be used toimplement certain features of some of the embodiments. The computersystem 900 may be a server computer, a client computer, a personalcomputer (PC), a user device, a tablet PC, a laptop computer, a personaldigital assistant, a cellular telephone, an IPHONE, an IPAD, aBLACKBERRY, a processor, a telephone, a web appliance, a network router,switch or bridge, a console, a hand-held console, a (hand-held) gamingdevice, a music player, any portable, mobile, hand-held device, wearabledevice, or any machine capable of executing a set of instructions(sequential or otherwise) that specify actions to be taken by thatmachine.

The computing system 900 may include one or more central processingunits (“processors”) 902, memory 904, input/output devices 906, e.g.keyboard and pointing devices, touch devices, display devices, storagedevices 908, e.g. disk drives, and network adapters 910, e.g. networkinterfaces, that are connected to an interconnect 912. The interconnect912 is illustrated as an abstraction that represents any one or moreseparate physical buses, point to point connections, or both connectedby appropriate bridges, adapters, or controllers. The interconnect 912,therefore, may include, for example, a system bus, a PeripheralComponent Interconnect (PCI) bus or PCI-Express bus, a HyperTransport orindustry standard architecture (ISA) bus, a small computer systeminterface (SCSI) bus, a universal serial bus (USB), IIC (12C) bus, or anInstitute of Electrical and Electronics Engineers (IEEE) standard 1394bus, also called FIREWIRE.

The memory 904 and storage devices 908 are computer-readable storagemedia that may store instructions that implement at least portions ofthe various embodiments. In addition, the data structures and messagestructures may be stored or transmitted via a data transmission medium,e.g. a signal on a communications link. Various communications links maybe used, e.g. the Internet, a local area network, a wide area network,or a point-to-point dial-up connection. Thus, computer readable mediacan include computer-readable storage media, e.g. non-transitory media,and computer-readable transmission media.

The instructions stored in memory 904 can be implemented as softwareand/or firmware to program the processor 902 to carry out actionsdescribed above. In some embodiments, such software or firmware may beinitially provided to the processing system 900 by downloading it from aremote system through the computing system 900, e.g. via network adapter910.

The various embodiments introduced herein can be implemented by, forexample, programmable circuitry (e.g., one or more microprocessors)programmed with software and/or firmware, or entirely in special-purposehardwired (non-programmable) circuitry, or in a combination of suchforms. Special-purpose hardwired circuitry may be in the form of, forexample, one or more ASICs, PLDs, FPGAs, etc.

The above description and drawings are illustrative and are not to beconstrued as limiting. Numerous specific details are described toprovide a thorough understanding of the disclosure. However, in certaininstances, well-known details are not described in order to avoidobscuring the description. Further, various modifications may be madewithout deviating from the scope of the embodiments.

Reference in this specification to “one embodiment” or “an embodiment”means that a particular feature, structure, or characteristic describedin connection with the embodiment is included in at least one embodimentof the disclosure. The appearances of the phrase “in one embodiment” invarious places in the specification are not necessarily all referring tothe same embodiment, nor are separate or alternative embodimentsmutually exclusive of other embodiments. Moreover, various features aredescribed which may be exhibited by some embodiments and not by others.Similarly, various requirements are described which may be requirementsfor some embodiments but not for other embodiments.

The terms used in this specification generally have their ordinarymeanings in the art, within the context of the disclosure, and in thespecific context where each term is used. Certain terms that are used todescribe the disclosure are discussed above, or elsewhere in thespecification, to provide additional guidance to the practitionerregarding the description of the disclosure. For convenience, certainterms may be highlighted, for example using italics and/or quotationmarks. The use of highlighting has no influence on the scope and meaningof a term; the scope and meaning of a term is the same, in the samecontext, whether or not it is highlighted. It will be appreciated thatthe same thing can be said in more than one way. One will recognize that“memory” is one form of a “storage” and that the terms may on occasionbe used interchangeably.

Consequently, alternative language and synonyms may be used for any oneor more of the terms discussed herein, nor is any special significanceto be placed upon whether or not a term is elaborated or discussedherein. Synonyms for certain terms are provided. A recital of one ormore synonyms does not exclude the use of other synonyms. The use ofexamples anywhere in this specification including examples of any termdiscussed herein is illustrative only, and is not intended to furtherlimit the scope and meaning of the disclosure or of any exemplifiedterm. Likewise, the disclosure is not limited to various embodimentsgiven in this specification.

Without intent to further limit the scope of the disclosure, examples ofinstruments, apparatus, methods and their related results according tothe embodiments of the present disclosure are given above. Note thattitles or subtitles may be used in the examples for convenience of areader, which in no way should limit the scope of the disclosure. Unlessotherwise defined, all technical and scientific terms used herein havethe same meaning as commonly understood by one of ordinary skill in theart to which this disclosure pertains. In the case of conflict, thepresent document, including definitions will control.

What is claimed is:
 1. A method performed by one or more servercomputers, the method comprising: receiving a query by the one or moreserver computers; generating a query plan in memory based on thereceived query, wherein the query plan includes computer instructionsconfigured to recursively substitute references to a plurality ofvirtual datasets with references to physical datasets contained in aplurality of data sources; detecting an optimized data structure in anautonomous memory, wherein the optimized data structure has not expiredand can be used to partially or entirely modify the query plan tosatisfy the query; determining that processing of the received query canbe accelerated by substituting at least a portion of the query plan toinclude a reference to the optimized data structure stored in theautonomous memory to partially or entirely satisfy the received query,wherein the optimized data structure is a raw reflection or anaggregation reflection, and wherein determining that the received querycan be accelerated comprises determining that a type of reflection ofthe optimized data structure satisfies the query cost-effectively, andwherein the optimized data structure is based on results from one ormore previously processed queries and/or includes pre-existing data; inresponse to determining that processing the received query can beaccelerated, modifying the query plan to refer to the optimized datastructure instead of referring to at least one of the physical datasetsreferred to in the query plan; and executing the modified query plan toobtain query results by: accessing the autonomous memory to utilize theoptimized data structure instead of accessing a physical dataset of theplurality of virtual datasets; and accessing at least some of theplurality of data sources to retrieve data of the plurality of virtualdatasets to satisfy the received query.
 2. The method of claim 1,further comprising, prior to receiving the query: generating theoptimized data structure to include raw data of at least one of theplurality of datasets to result in the raw reflection.
 3. The method ofclaim 1, further comprising, prior to receiving the query: generatingthe optimized data structure to include an aggregation of one or moredata columns of at least one of the plurality of datasets to result inthe aggregation reflection.
 4. The method of claim 2, wherein the rawreflection comprises at least one of sorted, partitioned, or distributeddata of one or more data columns of the at least one of the plurality ofdatasets.
 5. The method of claim 1, further comprising, prior toreceiving the query: generating the optimized data structure to includedata sampled from at least one of the plurality of datasets.
 6. Themethod of claim 1, wherein the received query is a second query and thequery results are second query results, the method further comprising,prior to receiving the second query: generating the optimized datastructure based on first query results that satisfy a first query. 7.The method of claim 6, wherein the query plan is a second query plan,and a first query plan is defined to have a scope broader than necessaryfor obtaining search results satisfying the first query such that thegenerated optimized data structure is broader than an optimized datastructure generated based on a query plan having a scope that isminimally sufficient for obtaining search results satisfying the firstquery.
 8. The method of claim 1, wherein the query results are obtainedwithout reading any of the plurality of datasets contained in theplurality of data sources.
 9. The method of claim 1, wherein the queryresults are obtained by reading at least some of the plurality ofdatasets contained in the plurality of data sources in addition toreading the optimized data structure.
 10. The method of claim 1, furthercomprising, prior to determining that the received query can beaccelerated: autonomously deciding to generate the optimized datastructure.
 11. The method of claim 10, wherein the decision to generatethe optimized data structure is based on a history of queries receivedby the one or more server computers.
 12. The method of claim 10, whereinthe decision to generate the optimized data structure is based on adetermination that reading the optimized data structure in lieu ofreading the at least some data from the plurality of data sourcesimproves processing of an expected workload.
 13. The method of claim 1,further comprising, prior to receiving the query: receiving user inputrequesting acceleration of queries on one or more datasets of theplurality of datasets; and generating the optimized data structure inresponse to the received request.
 14. The method of claim 1, the methodfurther comprising, prior to receiving the query: receiving user inputdefining a virtual dataset derived from a physical dataset contained inthe plurality of data sources, wherein the plurality of datasetsincludes the virtual dataset.
 15. The method of claim 1, wherein themodified query plan is only executed by a distributed query engine ofthe one or more computer servers.
 16. A computer system comprising: aprocessor; and an autonomous memory containing instructions that, whenexecuted by the processor, cause the computer system to: connect to aplurality of data sources that contain a plurality of physical datasets;cause display of a visual dataset editor; allow a plurality of users tocurate data by using the visual dataset editor to create a plurality ofvirtual datasets derived from the plurality of physical datasets withoutcreating any physical copies of the curated data; autonomously generatean optimized data structure based on a physical dataset contained in theplurality of data sources, wherein the optimized data structure is a rawreflection or an aggregation reflection, and wherein the optimized datastructure is based on_results from one or more previously processedqueries and/or includes pre-existing data; store the optimized datastructure in the autonomous memory, the optimized data structure in theautonomous memory configured to be associated with a time of expiration;detect the time of expiration has not expired; in response to detectingthat the expiration time has not expired, substitute at least a portionof a query plan to refer to the optimized data structure stored in theautonomous memory instead of referring to the physical dataset referredto in the query plan; determine that a type of reflection of theoptimized data structure satisfies the query cost-effectively, whereinthe query plan includes computer instructions configured to recursivelysubstitute references to the virtual dataset with references to thephysical dataset contained in the plurality of data sources, wherein theraw reflection includes raw data of the physical dataset or the virtualdataset derived from the physical dataset for compiling the query, andwherein the aggregation reflection includes an aggregation of one ormore data columns of at least one of the physical dataset or the virtualdataset derived from the physical dataset for satisfying the query; andexecute the query plan to obtain query results by: accessing theautonomous memory to utilize the optimized data structure instead ofaccessing the physical dataset; and accessing at least some of theplurality of data sources to retrieve data of the plurality of virtualdatasets to satisfy the query.
 17. The system of claim 16, wherein theplurality of virtual datasets are exposed as tables in clientapplications.
 18. The system of claim 16 further caused to: allow theplurality of users to share the plurality of virtual datasets via thevisual dataset editor.
 19. The system of claim 16, wherein the visualdataset editor includes a control that upon being selected by a usercauses the computer system to open a client application connected to avirtual dataset.
 20. The system of claim 16 further caused to: causedisplay of a visualization indicative of relationships between aplurality of physical datasets and a plurality of virtual datasets.